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1 Fast detection of communication patterns in distributed executions | 
Thomas Kunz, Michiel F. H. Seuren 

November 1997 Proceedings of the 1997 conference of the Centre for Advanced Studies 
on Collaborative research 

Full text available: ^pdf(4.21 MB) Additional Information: full citation , abstract , references , index terms 

Understanding distributed applications is a tedious and difficult task. Visualizations based on 
process-time diagrams are often used to obtain a better understanding of the execution of 
the application. The visualization tool we use is Poet, an event tracer developed at the 
University of Waterloo. However, these diagrams are often very complex and do not provide 
the user with the desired overview of the application. In our experience, such tools display 
repeated occurrences of non-trivial commun ... 

2 The true lift model: a novel data mining approach to response modeling in database j 
marketing 

Victor S. Y. Lo 

December 2002 ACM SIGKDD Explorations Newsletter, volume 4 issue 2 

Full text available: 1jQ pdf(1 19.81 KB) Additional Information: full citation , abstract , references 



In database marketing, data mining has been used extensively to find the optimal customer 
targets so as to maximize return on investment. In particular, using marketing campaign 
data, models are typically developed to identify characteristics of customers who are most 
likely to respond. While these models are helpful in identifying the likely responders, they 
may be targeting customers who have decided to take the desirable action or not 
regardless of whether they receive the campaign contact (e ... 

Keywords: customer development, customer relationship management, data mining, 
database marketing, interaction effect, knowledge discovery, predictive modeling, response 
modeling, treatment effect, true lift, upselling and cross-selling 



Extracting predicates from mining models for efficient query evaluation 
Surajit Chaudhuri, Vivek Narasayya, Sunita Sarawagi 

September 2004 ACM Transactions on Database Systems (TODS), volume 29 issue 3 
Full text available: fiQ pdf(698.37 KB) Additional Information: full citation , abstract , references , index terms 



Modern relational database systems are beginning to support ad hoc queries on mining 
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models. In this article, we explore novel techniques for optimizing queries that contain 
predicates on the results of application of mining models to relational data. For such 
queries, we use the internal structure of the mining model to automatically derive 
traditional database predicates. We present algorithms for deriving such predicates for a 
large class of popular discrete mining models: decision trees, nai ... 

Keywords: Complex predicate optimization, simpler rules from complex predictive 
functions 
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Towards on-line analytical mining in large databases 
Jiawei Han 

March 1998 ACM SIGMOD Record, volume 27 issue l 

Full text available: ^ pdf(387.04 KB) Additional Information: full citation , abstract , citings , index terms 

Great efforts have been paid in the Intelligent Database Systems Research Lab for the 
research and development of efficient data mining methods and construction of on-line 
analytical data mining systems.Our work has been focused on the integration of data 
mining and OLAP technologies and the development of scalable, integrated, and multiple 
data mining functions. A data mining system, DBMiner, has been developed for interactive 
mining of multiple-level knowledge in large relational databases and ... 

Range and /cNN query processing for moving objects in grid model 

Hae Don Chon, Divyakant Agrawal, Amr El Abbadi 

August 2003 Mobile Networks and Applications, volume 8 issue 4 

_ || , . ., = .„ 04n ~ „ m Additional Information: full citation , abstract , references , citings , index 
Full text available: tm pQUzwj.jf Kb) ~ 
lc - r terms 

With the growing popularity of mobile computing devices and wireless communications, 
managing dynamically changing information about moving objects is becoming feasible. In 
this paper, we implement a system that manages such information and propose two query 
algorithms: a range query algorithm and a k nearest neighbor algorithm. The range query 
algorithm is combined with an efficient filtering technique which determines if a polyline 
corresponding to the trajectory of a moving object inte ... 

Keywords: k nearest neighbors query, moving objects, range query 



6 Reports from KDD-2001 : KDD Cup 2001 report 

Jie Cheng, Christos Hatzis, Hisashi Hayashi, Mark-A. Krogel, Shinichi Morishita, David Page, 
Jun Sese 

January 2002 ACM SIGKDD Explorations Newsletter, volume 3 issue 2 

Full text available: ^ pdf(1.96 MB) Additional Information: full citation , abstract , references , citings 

This paper presents results and lessons from KDD Cup 2001. KDD Cup 2001 focused on 
mining biological databases. It involved three cutting-edge tasks related to drug design and 
genomics. 

Keywords: Competition, biology, drug design, genomics 



Models and languages for parallel computation 
David B. Skillicorn, Domenico Talia 

June 1998 ACM Computing Surveys (CSUR), volume 30 issue 2 

Full text available: « odf(298.05 KB) Additional Information: full citation , abstract, references , citings, index 
^ terms 
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We survey parallel programming models and languages using six criteria to assess their 
suitability for realistic portable parallel programming. We argue that an ideal model should 
by easy to program, should have a software development methodology, should be 
architecture-independent, should be easy to understand, should guarantee performance, 
and should provide accurate information about the cost of programs. These criteria reflect 
our belief that developments in parallelism must be driven b ... 

Keywords: general-purpose parallel computation, logic programming languages, object- 
oriented languages, parallel programming languages, parallel programming models, 
software development methods, taxonomy 



8 Articles on microarray data mining: Statistical methods for joint data mining of gene 
expression and DNA sequence database 
Maria D. Curran, Hong Liu, Fan Long, Nanxiang Ge 
December 2003 ACM SIGKDD Explorations Newsletter volume 5 issue 2 

Full text available: ^j) pdf(869.45 KB) Additional Information: full citation , abstract , references 

One of the purposes of microarray gene expression experiments is to identify genes 
regulated under specific cellular conditions. With the availability of putative transcription 
factor binding motifs, it is now possible to relate gene expression pattern to the pattern of 
transcription factor binding sites (TFBS), as well as study how TFBS interact with each other 
to control gene expression. The objectives of this study are to develop a systematic 
approach for combining data from microarray gene e ... 

Keywords: 7-helper cells, cluster analysis, logistic regression, microarray, modeling, 
regulatory motifs, transcription factor binding site (TFBS) 



9 Research sessions: data mining applications: Cost-based labeling of groups of mass jjjjj 
spectra 

Lei Chen, Zheng Huang, Raghu Ramakrishnan 

June 2004 Proceedings of the 2004 ACM SIGMOD international conference on 
Management of data 

Full text available: ^ pdf(351.21 KB) Additional Information: full citation , abstract , references 

We make two main contributions in this paper. First, we motivate and introduce a novel 
class of data mining problems that arise in labeling a group of mass spectra, specifically for 
analysis of atmospheric aerosols, but with natural applications to market-basket datasets. 
This builds upon other recent work in which we introduced the problem of labeling a single 
spectrum, and is motivated by the advent of a new generation of Aerosol Time-of-Flight 
Spectrometers, which are capable of generating ma ... 



10 Industry/government track papers: Predicting prostate cancer recurrence via 

maximizing the concordance index 
□an Yan, David Verbel, Olivier Saidi 

August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(1 72.87 KB) Additional information: full citation , abstract , references , index terms 

In order to effectively use machine learning algorithms, e.g., neural networks, for the 
analysis of survival data, the correct treatment of censored data is crucial. The concordance 
index (CI) is a typical metric for quantifying the predictive ability of a survival model. We 
propose a new algorithm that directly uses the CI as the objective function to train a model, 
which predicts whether an event will eventually occur or not. Directly optimizing the CI 
allows the model to make complete use of ... 
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Keywords: concordance index, neural networks, nomogram, prostate cancer recurrence, 
survival analysis 



11 Multi Relational Data Mining (MRDM): Biological applications of multi-relational data Jggj 
mining 

David Page, Mark Craven 

July 2003 ACM SZGKDD Explorations Newsletter, volume 5 issue l 

Full text available: ^ pdfd.12 MB) Additional Information: full citation , abstract , references , citings 

Biological databases contain a wide variety of data types, often with rich relational 
structure. Consequently multi-relational data mining techniques frequently are applied to 
biological data. This paper presents several applications of multi-relational data mining to 
biological data, taking care to cover a broad range of multi-relational data mining 
techniques. 

12 Industry/government track posters: Mining traffic data from probe-car system for travel j 
time prediction 

Takayuki Nakata, Jun-ichi Takeuchi 

August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(297.21 KB) Additional Information: full citation , abstract , references , index terms 

We are developing a technique to predict travel time of a vehicle for an objective road 
section, based on real time traffic data collected through a probe-car system. In the area of 
Intelligent Transport System (ITS), travel time prediction is an important subject. Probe-car 
system is an upcoming data collection method, in which a number of vehicles are used as 
moving sensors to detect actual traffic situation. It can collect data concerning much larger 
area, compared with traditional fixed dete ... 

Keywords: ITS, information criterion, probe-car, time series, travel time 



13 Searching in high-dimensional spaces: Index structures for improving the performance 

of multimedia databases 

Christian Bohm, Stefan Berchtold, Daniel A. Keim 

September 2001 ACM Computing Surveys (CSUR), volume 33 issue 3 

Full text available- fiQpdfd.39 MB) Additional Information: full citation , abstract , references , citings, index 
" ^ terms 

During the last decade, multimedia databases have become increasingly important in many 
application areas such as medicine, CAD, geography, and molecular biology. An important 
research issue in the field of multimedia databases is the content-based retrieval of similar 
multimedia objects such as images, text, and videos. However, in contrast to searching 
data in a relational database, a content-based retrieval requires the search of similar 
objects as a basic functionality of the database system ... 

Keywords: Index structures, indexing high-dimensional data, multimedia databases, 
similarity search 



14 Articles on microarray data mining: Differential expression, class discovery and class 
prediction using S-PLUS and S+ArravAnalyzer 
Michael O'Connell 

December 2003 ACM SIGKDD Explorations Newsletter, Volume 5 issue 2 
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Full text available: ^ pdf(958.46 KB) Additional Information: full citation , abstract , references 

Microarrays are a powerful experimental platform, allowing simultaneous studies of gene 
expression for thousands of genes under different experimental conditions. However there 
is much biological variability induced throughout the experimental process that can obscure 
the biological signals of interest. As such, the need for experimental design, replication and 
statistical rigor are now widely recognized. Statistical hypothesis testing has become the 
accepted differential expression analysis app ... 

Keywords: S+ArrayAnalyzer, S-PLUS, class discovery, class prediction, differential 
expression 



15 Articles on microarrav data mining: Towards interactive exploration of gene expression j 
patterns 

Daxin Jiang, Jian Pei, Aidong Zhang 

December 2003 ACM SIGKDD Explorations Newsletter, volume 5 issue 2 

Full text available: ^ pdf(527.68 KB) Additional Information: full citation , abstract , references 

Analyzing coherent gene expression patterns is an important task in bioinformatics research 
and biomedical applications. Recently, various clustering methods have been adapted or 
proposed to identify clusters of co-expressed genes and recognize coherent expression 
patterns as the centroids of the clusters. However, the interpretation of co-expressed genes 
and coherent patterns mainly depends on the domain knowledge, which presents several 
challenges for coherent pattern mining and cannot be solv ... 

16 A survey on wavelet applications in data mining 
Tao Li, Qi Li, Shenghuo Zhu, Mitsunori Ogihara 

December 2002 ACM SIGKDD Explorations Newsletter volume 4 issue 2 
Full text available: ^ pdf(330.06 KB) Additional Information: full citation , abstract , references , citings 

Recently there has been significant development in the use of wavelet methods in various 
data mining processes. However, there has been written no comprehensive survey available 
on the topic. The goal of this is paper to fill the void. First, the paper presents a high-level 
data-mining framework that reduces the overall process into smaller components. Then 
applications of wavelets for each component are reviewd. The paper concludes by 
discussing the impact of wavelets on data mining research an ... 

17 Special issue on wireless extensions to the internet: Prediction-based monitoring in 
sensor networks: taking lessons from MPEG 
Samir Goel, Tomasz Imielinski 

October 2001 ACM SIGCOMM Computer Communication Review, volume 31 issue 5 
Full text available: ^ pdf(1 .62 MB) Additional Information: full citation , abstract , references 

In this paper we discuss the problem of monitoring data sensed in large sensor networks. A 
sensor typically runs on a battery having a limited lifetime. In order to increase the lifetime 
of a sensor it is important that the mechanisms used in monitoring them be energy- 
efficient. In this paper, we propose a new paradigm called Prediction-based monitoring for 
energy-efficient monitoring. We show that the paradigm can be visualized as a watching of 
a "sensor movie" and that concepts from MPEG ma ... 

18 Evolving data mining into solutions for insights: Data-driven evolution of data mining 
algorithms 

Padhraic Smyth, Daryl Pregibon, Christos Faloutsos 
August 2002 Communications of the ACM, volume 45 issue 8 
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Full text available: ffi pdf(1 06.77 KB) Additional Information: full citation , abstract , references , citings , index 
gj html(27.95 KB) terms 

Fundamentally, these algorithms are driven by the nature of the data being analyzed, in 
both scientific and commercial applications. 

19 Research track: Translation-invariant mixture models for curve clustering 
Darya Chudova, Scott Gaffney, Eric Mjolsness, Padhraic Smyth 
August 2003 Proceedings of the ninth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(688.59 KB) Additional Information: full citation , abstract , references , index terms 

In this paper we present a family of algorithms that can simultaneously align and cluster 
sets of multidimensional curves defined on a discrete time grid. Our approach uses the 
Expectation-Maximization (EM) algorithm to recover both the mean curve shapes for each 
cluster, and the most likely shifts, offsets, and cluster memberships for each curve. We 
demonstrate how Bayesian estimation methods can improve the results for small sample 
sizes by enforcing smoothness in the cluster mean curves. We e ... 

Keywords: EM, alignment, curve clustering, mixture model, transformation invariance 



20 Biclustering Algorithms for Biological Data Analysis: A Survey 
Sara C. Madeira, Arlindo L. Oliveira 

January 2004 IEEE/ACM Transactions on Computational Biology and Bioinformatics 
(TCBB), Volume 1 Issue 1 



Full text available: 1 



Additional Information: full citation 



Keywords: Biclustering, simultaneous clustering, coclustering, subspace clustering, 
bidimensional clustering, direct clustering, block clustering, two-way clustering, two-mode 
clustering, two-sided clustering, microarray data analysis, biological data analysis, gene 
expression data. 
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21 Searching in metric spaces 

Edgar Chavez, Gonzalo Navarro, Ricardo Baeza-Yates, Jose Luis Marroqum 
September 2001 ACM Computing Surveys (CSUR), volume 33 issue 3 

Additional Information: full citation , abstract , references , citings , index 
terms 



Full text available: 



The problem of searching the elements of a set that are close to a given query element 
under some similarity criterion has a vast number of applications in many branches of 
computer science, from pattern recognition to textual and multimedia information retrieval. 
We are interested in the rather general case where the similarity criterion defines a metric 
space, instead of the more restricted case of a vector space. Many solutions have been 
proposed in different areas, in many cases without cros ... 



Keywords: Curse of dimensionality, nearest neighbors, similarity searching, vector spaces 



22 Reports from related meetings: Interface '99: a data mining overview 
Arnold Goodman 

January 2000 ACM SIGKDD Explorations Newsletter, volume l issue 2 
Full text available: || [pdf(851.62 KB) Additional Information: full citation , abstract , references 

This personal overview of Interface '99 is intended to communicate its meaning and 
relevance to SIGKDD, as well as provide valuable information on trends within the Interface 
for data miners seeking to learn more about statistics. In addition, it is the newest link in a 
bridge between the Interface and KDD begun by References 2-4 and the sessions on KDD 
at Interface '98 and Interface '99. 

Keywords: review of Interface'99 conference, statistics 




23 Quantifiable data mining using ratio rules 

Flip Korn, Alexandras Labrinidis, Yannis Kotidis, Christos Faloutsos 

February 2000 The VLDB Journal — The International Journal on Very Large Data 

Bases, Volume 8 Issue 3-4 
Full text available: 1 5ilpdf(451.80 KB) Additional Information: full citation , abstract , index terms 



Association Rule Mining algorithms operate on a data matrix (e.g., customers $\times$ 
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products) to derive association rules [AIS93b, SA96]. We propose a new paradigm, namely, 
Ratio Rules, which are quantifiable in that we can measure the "goodness" of a set of 
discovered rules. We also propose the "guessing error" as a measure of the "goodness", 
that is, the root-mean-square error of the reconstructed values of the cells of the given 
matrix, when we pre ... 

Keywords: Data mining, Forecasting, Guessing error, Knowledge discovery 



24 Contributed articles on online, interactive, and anytime data mining: MobiMine: 
monitoring the stock market from a PDA 

Hillol Kargupta, Byung-Hoon Park, Sweta Pittie, Lei Liu, Deepali Kushraj, Kakali Sarkar 
January 2002 ACM SIGKDD Explorations Newsletter, volume 3 issue 2 

Full text available: Q^dfQJ6_MB) Additional Information: full citation , abstract , references , citings 

This paper describes an experimental mobile data mining system that allows intelligent 
monitoring of time-critical financial data from a hand-held PDA. It presents the overall 
system architecture and the philosophy behind the design. It explores one particular aspect 
of the system— automated construction of personalized focus area that calls for user's 
attention. This module works using data mining techniques. The paper describes the data 
mining component of the system that employs a novel Four ... 

25 SPARTAN: a model-based semantic compression system for massive data tables 
Shivnath Babu, Minos Garofalakis, Rajeev Rastogi 

May 2001 ACM SIGMOD Record , Proceedings of the 2001 ACM SIGMOD international 
conference on Management of data, volume 30 issue 2 

Full text available* fi3 odf(240 19 KB) Aclditiona, Information: full citation , abstract , references , citings , index 
u • L5J-B— terms 

While a variety of lossy compression schemes have been developed for certain forms of 
digital data (e.g., images, audio, video), the area of lossy compression techniques for 
arbitrary data tables has been left relatively unexplored. Nevertheless, such techniques are 
clearly motivated by the ever-increasing data collection rates of modern enterprises and the 
need for effective, guaranteed-quality approximate answers to queries over massive 
relational data sets. In this paper, we propose SPA ... 

26 Identifying prospective customers 

Paul B. Chou, Edna Grossman, Dimitrios Gunopulos, Pasumarti Kamesam 
August 2000 Proceedings of the sixth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(1 70.89 KB) Additional Information: full citation , references , citings , index terms 



Keywords: customer prospecting 

27 Articles on microarray data mining: Supervised analysis when the number of candidate j 
features (p) greatly exceeds the number of cases (n) 
Richard Simon 

December 2003 ACM SIGKDD Explorations Newsletter, volume 5 issue 2 

Full text available: ^pdf(1 83.08 KB) Additional Information: full citation , abstract , references 

New genomic and proteomic technologies provide measurements of thousands of features 
for each case. This provides a context for enhanced discovery and false discovery. Most 
statistical and machine learning procedures were not developed for the p>>n setting and 
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the literature of DNA microarray studies contains many examples of mis-use of analytic and 
computatinal methods such a cross-validation. This paper highlights some of key aspects of 
p>>n problems for identifying informative fea ... 

Keywords: classification, cross-validation, prediction 



28 Self-spacial join selectivity estimation using fractal concepts 
Alberto Belussi, Christos Faloutsos 

April 1998 ACM Transactions on Information Systems (TOIS), volume 16 issue 2 

Full text available* t§l Ddf(2 28 MB) Additional Information: full citation , abstract , references , citings , index 

' terms 

The problem of selectivity estimation for queries of nontraditional databases is still an open 
issue. In this article, we examine the problem of selectivity estimation for some types of 
spatial queries in databases containing real data. We have shown earlier [Faloutsos and 
Kamel 1994] that real point sets typically have a nonuniform distribution, violating 
consistently the uniformity and independence assumptions. Moreover, we demonstrated 
that the theory of ... 

Keywords: fractal dimension, range query, selectivity estimation, spatial join 



29 Industrial/government track: Information awareness: a prospective technical 
assessment 

David Jensen, Matthew Rattigan, Hannah Blau 

August 2003 Proceedings of the ninth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: ^ pdf(987.48 KB) Additional Information: full citation , abstract , references , index terms 

Recent proposals to apply data mining systems to problems in law enforcement, national 
security, and fraud detection have attracted both media attention and technical critiques of 
their expected accuracy and impact on privacy. Unfortunately, the majority of technical 
critiques have been based on simplistic assumptions about data, classifiers, inference 
procedures, and the overall architecture of such systems. We consider these critiques in 
detail, and we construct a simulation model that more cl ... 

Keywords: TIA, collective classification, information awareness, iterative classification, 
privacy, ranking classifiers, relational data mining, social network analysis, technology 
assessment 



30 Industry/government track posters: ANN quality diagnostic models for packaging 
manufacturing: an industrial data mining case study 
Nicolas de Abajo, Alberto B. Diez, Vanesa Lobato, Sergio R. Cuesta 
August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 

Knowledge discovery and data mining 
Full text available: ^ pdf(1.14 MB) Additional Information: full citation , abstract , references , index terms 

World steel trade becomes more competitive every day and new high international quality 
standards and productivity levels can only be achieved by applying the latest computational 
technologies. Data driven analysis of complex processes is necessary in many industrial 
applications where analytical modeling is not possible. This paper presents the deployment 
of KDD technology in one real industrial problem: the development of new tinplate quality 
diagnostic models.The electrodeposition of tin on s ... 

Keywords: ANNs, CRISP-DM, FMEA, tinplate quality 
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31 DBMiner: a system for data mining in relational databases and data warehouses 
Jiawei Han, Jenny Y. Chiang, Sonny Chee, Jianping Chen, Qing Chen, Shan Cheng, Wan Gong, 
Micheline Kamber, Krzysztof Koperski, Gang Liu, Yijun Lu, Nebojsa Stefanovic, Lara Winstone, 
Betty B. Xia, Osmar R. Zaiane, Shuhua Zhang, Hua Zhu 

November 1997 Proceedings of the 1997 conference of the Centre for Advanced Studies 
on Collaborative research 

Additional Information: full citation , abstract , references , citings , index 



Full text available: 

terms 

A data mining system, DBMiner, has been developed for interactive mining of multiple-level 
knowledge in large relational databases and data warehouses. The system implements a 
wide spectrum of data mining functions, including characterization, comparison, 
association, classification, prediction, and clustering. By incorporating several interesting 
data mining techniques, including OLAP and attribute-oriented induction, statistical analysis, 
progressive deepening for mining multiple-level knowled ... 

32 A unified framework for model-based clustering | 
Shi Zhong, Joydeep Ghosh 

December 2003 The Journal of Machine Learning Research, volume 4 

Full text available: ^ pdf(851 .48 KB) Additional Information: full citation , abstract , index terms 

Model-based clustering techniques have been widely used and have shown promising 
results in many applications involving complex data. This paper presents a unified 
framework for probabilistic model-based clustering based on a bipartite graph view of data 
and models that highlights the commonalities and differences among existing model-based 
clustering algorithms. In this view, clusters are represented as probabilistic models in a 
model space that is conceptually separate from the data space. For ... 

33 Index-driven similarity search in metric spaces | 
Gisli R. Hjaltason, Hanan Samet 

December 2003 ACM Transactions on Database Systems (TODS), volume 28 issue 4 
Full text available: Q p df (650,64 KB) Additional Information: full citation , abstract , references , index terms 

Similarity search is a very important operation in multimedia databases and other database 
applications involving complex objects, and involves finding objects in a data set S similar 
to a query object q, based on some similarity measure. In this article, we focus on methods 
for similarity search that make the general assumption that similarity is represented with a 
distance metric d. Existing methods for handling similarity search in this setting typically fall 
into one of ... 

Keywords: Hiearchical metric data structures, distance-based indexing, nearest neighbor 
queries, range queries, ranking, similarity searching 



34 Poster papers: Tumor cell identification usin g features rules 
Bin Fang, Wynne Hsu, Mong Li Lee 

July 2002 Proceedings of the eighth ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available* 13 pdf(152 89 KB) Additional Information: full citation , abstract , references , citings , index 
• * : terms 

Advances in imaging techniques have led to large repositories of images. There is an 
increasing demand for automated systems that can analyze complex medical images and 
extract meaningful information for mining patterns. Here, we describe a real-life image 
mining application to the problem of tumour cell counting. The quantitative analysis of 
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tumour cells is fundamental to characterizing the activity of tumour cells. Existing 
approaches are mostly manual, time-consuming and subjective. Efforts t ... 

Keywords: dynamic water immersion, features rules, identification, local adaptive 
thresholding, majority vote, meta classifier, weighted vote 



35 Papers from MC 2 R open call: Towards integrated PSEs for wireless communications: | 
ex periences with the S^W and SitePlanner© pro j ects 

Roger R. Skidmore, Alex Verstak, Naren Ramakrishnan, Theodore S. Rappaport, Layne T. 
Watson, Jian He, Srinidhi Varadarajan, Clifford A. Shaffer, Jeremy Chen, Kyung Kyoon Bae, 
jing Jiang, William H. Tranter 

April 2004 ACM SIGMOBILE Mobile Computing and Communications Review, volume 8 

Issue 2 

Full text available: ^ pdf(620.32 KB) Additional Information: full citation , abstract , references 

This paper describes the computational methodologies of two problem solving environments 
(PSEs) for wireless network design and analysis, one academic (S 4 W) and one commercial 
(SitePlanner®). The PSEs address differently common computational issues such as 
environment specification, propagation modeling, channel performance prediction, system 
design optimization, and data management. The intended uses, interfaces, and capabilities 
of the two PSEs are compared and contrasted in a c ... 

36 Research sessions: data mining: Clusterin g by pattern similarity in large data sets j 
Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu 

June 2002 Proceedings of the 2002 ACM SIGMOD international conference on 
Management of data 

Full text available* fi3 df(1 09 MB) Additional Information: full citation , abstract , references , citings, index 
"™ terms ' 

Clustering is the process of grouping a set of objects into classes of similar objects. 
Although definitions of similarity vary from one clustering model to another, in most of 
these models the concept of similarity is based on distances, e.g., Euclidean distance or 
cosine distance. In other words, similar objects are required to have close values on at least 
a set of dimensions. In this paper, we explore a more general type of similarity. Under the 
pCluster model we proposed, two objects ... 

37 Subspace clustering for high dimensional data: a review j 
Lance Parsons, Ehtesham Haque, Huan Liu 

June 2004 ACM SIGKDD Explorations Newsletter, volume 6 issue l 

Full text available: ^ pdf(539.13 KB) Additional Information: full citation , abstract , references 

Subspace clustering is an extension of traditional clustering that seeks to find clusters in 
different subspaces within a dataset. Often in high dimensional data, many dimensions are 
irrelevant and can mask existing clusters in noisy data. Feature selection removes irrelevant 
and redundant dimensions by analyzing the entire dataset. Subspace clustering algorithms 
localize the search for relevant dimensions allowing them to find clusters that exist in 
multiple, possibly overlapping subspaces. The ... 

Keywords: clustering survey, high dimensional data, projected clustering, subspace 
clustering 



38 Bioinformatics — an introduction for computer scientists 
Jacques Cohen 

June 2004 ACM Computing Surveys (CSUR), volume 36 issue 2 
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Full text available: ^ pdf(261.56 KB) Additional Information: full citation , abstract , references , index terms 

The article aims to introduce computer scientists to the new field of bioinformatics. This 
area has arisen from the needs of biologists to utilize and help interpret the vast amounts 
of data that are constantly being gathered in genomic research— and its more recent 
counterparts, proteomics and functional genomics. The ultimate goal of bioinformatics is to 
develop in silico models that will complement in vitro and in vivo biological experiments. 
The article provides a bird's eye view of the ... 

Keywords: DNA, Molecular cell biology, RNA and protein structure, alignments, cell 
simulation and modeling, computer, dynamic programming, hidden-Markov-models, 
microarray, parsing biological sequences, phylogenetic trees 



39 Mining scientific data 

Usama Fayyad, David Haussler, Paul Stolorz 

November 1996 Communications of the ACM, volume 39 issue n 

Full text available: fiQ pdf(437.05 KB) Additional Information: full citation, references , citings , index terms 



40 Articles on microarray data mining: Machine learning in low-level microarray analysis 
Benjamin I. P. Rubinstein, Jon McAuliffe, Simon Cawley, Marimuthu Palaniswami, Kotagiri 
Ramamohanarao, Terence P. Speed 

December 2003 ACM SIGKDD Explorations Newsletter, volume 5 issue 2 

Full text available: ^ pdf(382.35 KB) Additional Information: full citation , abstract , references 

Machine learning and data mining have found a multitude of successful applications in 
microarray analysis, with gene clustering and classification of tissue samples being widely 
cited examples. Low-level microarray analysis -- often associated with the pre-processing 
stage within the microarray life-cycle has increasingly become an area of active research, 
traditionally involving techniques from classical statistics. This paper explores opportunities 
for the application of machine learning an ... 

Keywords: gene expression estimation, genotyping, incremental learning, learning from 
heterogeneous data, low-level microarray analysis, re-sequencing, semi-supervised 
learning, transcript discovery, transductive learning 
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41 Ex periments in social data mining: The TopicShop system 

Brian Amento, Loren Terveen, Will Hill, Deborah Hix, Robert Schulman 

March 2003 ACM Transactions on Computer-Human Interaction (TOCHI), volume 10 issue l 

Additional Information: full citation , abstract , references , citings , index 
terms 



Full text available: g pdf(377.92 KB) 



Social data mining systems enable people to share opinions and benefit from each other's 
experience. They do this by mining and redistributing information from computational 
records of social activity such as Usenet messages, system usage history, citations, or 
hyperlinks. Some general questions for evaluating such systems are: (1) is the extracted 
information valuable? and (2) do interfaces based on the information improve user task 
performance? We report here on TopicShop, a syst ... 

Keywords: Cocitation analysis, collaborative filtering, computer-supported cooperative 
work, information visualization, social filtering, social network analysis 



42 Data mining: Efficient detection of motion patterns in spatio-temporal data sets 
Joachim Gudmundsson, Marc van Kreveld, Bettina Speckmann 

November 2004 Proceedings of the 12th annual ACM international workshop on 

Geographic information systems 
Full text available: ^pdf (212.07 KB) Additional Information: full citation , abstract , references , index terms 

Moving point object data can be analyzed through the discovery of patterns. We consider 
the computational efficiency of detecting four such spatio-temporal patterns, namely flock, 
leadership, convergence, and encounter, as defined by Laube et al., 2004. These patterns 
are large enough subgroups of the moving point objects that exhibit similar movement in 
the sense of direction, heading for the same location, and/or proximity. By the use of 
techniques from computational geometry, including app ... 

Keywords: approximation algorithms, geometric algorithms, moving objects, spatio- 
temporal patterns 



43 Multidimensional access methods 
Volker Gaede, Oliver Gunther 

June 1998 ACM Computing Surveys (CSUR), Volume 30 issue 2 
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Full text available: ^ pdfd.05 MB) Additional Information: full citation , abstract , references , citings , index 

terms 

Search operations in databases require special support at the physical level. This is true for 
conventional databases as well as spatial databases, where typical search operations 
include the point query (find all objects that contain a given search point) and the region 
query (find all objects that overlap a given search region). More than ten years of spatial 
database research have resulted in a great variety of multidimensional access methods to 
support ... 

Keywords: data structures, multidimensional access methods 



44 From informatics to bioinformatics 

Vladimir B. Bajic, Vladimir Brusic, Jinyan Li, See-Kiong Ng, Limsoon Wong 
January 2003 Proceedings of the First Asia-Pacific bioinformatics conference on 
Bioinformatics 2003 - Volume 19 

Full text available: ^ pdf(538.23 KB) Additional Information: full citation , abstract , references , index terms 

Informatics has helped in launching molecular biology into the genomic era. It appears 
certain that informatics will continue to be a major factor in the success of molecular 
biology in the post-genome era. In this paper, we describe advances made in data 
integration and data mining technologies that are relevant to molecular biology and 
biomedical sciences. In particular, we discuss some past and present research results on 
topics such as (a) the taming of autonomous heterogeneous distributed d ... 

Keywords: Dragon, FIMM, Kleisli, PCL, PIES, bioinformatics, data integration, data 
warehousing, epitope prediction, gene expression analysis, protein interaction extraction, 
transcription start site recognition 



45 Research track posters: Why collective inference improves relational classification 
David Jensen, Jennifer Neville, Brian Gallagher 

August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Full text available: 'jT] pdf (496.13 KB) Additional Information: full citation , abstract, references , index terms 

Procedures for collective inference make simultaneous statistical judgments about the same 
variables for a set of related data instances. For example, collective inference could be used 
to simultaneously classify a set of hyperlinked documents or infer the legitimacy of a set of 
related financial transactions. Several recent studies indicate that collective inference can 
significantly reduce classification error when compared with traditional inference 
techniques. We investigate the under ... 

Keywords: collective inference, probabilistic relational models, relational learning 



46 Research track papers: Cyclic pattern kernels for predictive graph mining 
Tamas Hqrvath, Thomas Gartner, Stefan Wrobel 

August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 

Knowledge discovery and data mining 
Full text available: ,r ~~\ pdf( .29l.65 KB) Additional Information: full citation , abstract , references , index terms 

With applications in biology, the world-wide web, and several other areas, mining of graph- 
structured objects has received significant interest recently. One of the major research 
directions in this field is concerned with predictive data mining in graph databases where 
each instance is represented by a graph. Some of the proposed approaches for this task 
rely on the excellent classification performance of support vector machines. To control the 
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computational cost of these approaches, the underl ... 

Keywords: computational chemistry, graph mining, kernel methods 



47 Detecting graph-based spatial outliers: al g orithms and applications (a summary of 
results) 

Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang 

August 2001 Proceedings of the seventh ACM SIGKDD international conference on 
Knowledge discovery and data mining 

Fulltextavailable:-npdf{59a38KB) Additional '"formation: full citation , abstract, references , cjtings, index 
" terms 

Identification of outliers can lead to the discovery of unexpected, interesting, and useful 
knowledge. Existing methods are designed for detecting spatial outliers in multidimensional 
geometric data sets, where a distance metric is available. In this paper, we focus on 
detecting spatial outliers in graph structured data sets. We define statistical tests, analyze 
the statistical foundation underlying our approach, design several fast algorithms to detect 
spatial outliers, and provide a cost model ... 

Keywords: Outlier Detection, Spatial Data Mining, Spatial Graphs 



48 Essential classification rule sets 
Elena Baralis, Silvia Chiusano 

January 2004 ACM Transactions on Database Systems (TODS), volume 29 issue 4 

Full text available: ^pdf( 479.09 KB ) Additional Information: full citation , abstract , references , index terms 

Given a class model built from a dataset including labeled data, classification assigns a new 
data object to the appropriate class. In associative classification the class model (i.e., the 
classifier) is a set of association rules. Associative classification is a promising technique for 
the generation of highly accurate classifiers. In this article, we present a compact form 
which encodes without information loss the classification knowledge available in a 
classification rule set. This form incl ... 

Keywords: Association rules, associative classification, concise representations 



49 PocketLens: Toward a personal recommender system 
Bradley N. Miller, Joseph A. Konstan, John Riedl 

July 2004 ACM Transactions on Information Systems (TOIS), volume 22 issue 3 

Full text available: |^pd f(1.10 MB ) Additional Information: full citation , abstract , references , index terms 

Recommender systems using collaborative filtering are a popular technique for reducing 
information overload and finding products to purchase. One limitation of current 
recommenders is that they are not portable. They can only run on large computers 
connected to the Internet. A second limitation is that they require the user to trust the 
owner of the recommender with personal preference data. Personal recommenders hold the 
promise of delivering high quality recommendations on palmtop computers, e ... 

Keywords: Collaborative Filtering, Peer-to-Peer Networking, Privacy, Recommender 
Systems 



50 Learning evaluation functi ons to improve optimization by local search 
Justin Boyan, Andrew W. Moore 
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September 2001 The Journal of Machine Learning Research, volume l 
Full text available: Q pdf(643.21 KB) Additional Information: full citation , abstract 

This paper describes algorithms that learn to improve search performance on large-scale 
optimization tasks. The main algorithm, STAGE, works by learning an evaluation function 
that predicts the outcome of a local search algorithm, such as hillclimbing or Walksat, from 
features of states visited during search. The learned evaluation function is then used to bias 
future search trajectories toward better optima on the same problem. Another algorithm, X- 
STAGE, transfers previously learned evaluation ... 

51 Data Mining with optimized two-dimensional association rules 

Takeshi Fukuda, Yasuhiko Mdrimoto, Shimichi Morishita, Takeshi Tokuyama 
June 2001 ACM Transactions on Database Systems (TODS), volume 26 issue 2 

Full text available: ^pdf( 947.41 KB ) Additional Information: full citation , abstract , references , index terms 

We discuss data mining based on association rules for two numeric attributes and one 
Boolean attribute. For example, in a database of bank customers, Age and Balance are two 
numeric attributes, and CardLoan is a Boolean attribute. Taking the pair (Age, Balance) as a 
point in two-dimensional space, we consider an association rule of the form Age,Balance 

Keywords: association rules, convex hull searching, data mining, image segmentation, 
matrix searching 



52 Position papers on MRDM: Prospects and challenges for multi-relational data mining j 
Pedro Domingos 

July 2003 ACM SIGKDD Explorations Newsletter, volume 5 issue l 

Full text available: *Q pdf(397.89 KB) Additional Information: full citation , abstract , references , citings 

This short paper argues that multi-relational data mining has a key role to play in the 
growth of KDD, and briefly surveys some of the main drivers, research problems, and 
opportunities in this emerging field. 

53 Research tra ck paper s: Minin g, indexing, and queryin g historical spatiotemporal data jgQ 

Nikos Mamoulis, Huiping Cao, George Kollios, Marios Hadjieleftheriou, Yufei Tao, David W. 
Cheung 

August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 

Knowledge discovery and data mining 
Full text available: t Qpdf(347.9 5 KB) Additional Information: full citation , abstract , references , index terms 

In many applications that track and analyze spatiotemporal data, movements obey periodic 
patterns; the objects follow the same routes (approximately) over regular time intervals. 
For example, people wake up at the same time and follow more or less the same route to 
their work everyday. The discovery of hidden periodic patterns in spatiotemporal data, 
apart from unveiling important information to the data analyst, can facilitate data 
management substantially. Based on this observation, we propose ... 

Keywords: indexing, pattern mining, spatiotemporal data, trajectories 

54 Special issue o n learnin g from imbalanced datasets: Mining with rarity: a unifying 

framework 
Gary M. Weiss 

June 2004 ACM SIGKDD Explorations Newsletter, Volume 6 Issue 1 
Full text available: ^ pdf(18 2.31 KB) Additional Information: full citation , abstract , references , citings 
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Rare objects are often of great interest and great value. Until recently, however, rarity has 
not received much attention in the context of data mining. Now, as increasingly complex 
real-world problems are addressed, rarity, and the related problem of imbalanced data, are 
taking center stage. This article discusses the role that rare classes and rare cases play in 
data mining. The problems that can result from these two forms of rarity are described in 
detail, as are methods for addressing these ... 

Keywords: class imbalance, cost-sensitive learning, inductive bias, rare cases, rare 
classes, sampling, small disjuncts 



55 Agg regate op erators in proba bilisti c databases 
Robert Ross, V. S. Subrahmanian, John Grant 
January 2005 Journal of the ACM (JACM), Volume 52 issue l 

Full text available: ^] pdf(8l6. 92 KB) Additional Information: full citation , abstract , references , index terms 

Though extensions to the relational data model have been proposed in order to handle 
probabilistic in F ormation, there has been very little work to date on handling aggregate 
operators in such databases. In this article, we present a very general notion of an 
aggregate operator and show how classical aggregation operators (such as COUNT, SUM, 
etc.) as well as statistical operators (such as percentiles, variance, etc.) are special cases of 
this general d finition. We devise a formal linear program ... 

Keywords: Aggregates, probabilistic relational databases 



56 Research trac k poster s: A fram ew o rk for ontology-driven subspace clusterin g 
Jinze Liu, Wei W-»ng, Jiong Yang 

August 2004 Pro edings of the 2004 ACM SIGKDD international conference on 

Kn- ledge discovery and data mining 
Full text available: r ' pc*f (68S .r2 KB) Additional Information: full citation , abstract , references , index terms 

Traditional clu Bering is a descriptive task that seeks to identify homogeneous groups of 
objects based on the values of tneir attributes. While domain knowledge is always the best 
way to justify clustering, few clustering algorithms have ever take domain knowledge into 
consideration. In this paper, the domain knowledge is represented by hierarchical ontology. 
We develop a framework by directly incorporating domain knowledge into clustering 
process, yie!"'°g a set of clusters with strong ontolog ... 

Keywords: otology, subspace clustering, tendency preserving 



57 Research track: Screen ; nq and interpreting multi-item associations based on log-linear 
modeling 

Xintao Wu, Daniel Barbara, Yong Ye 

August 2003 Proceedings of the ninth ACM SIGKDD international conference on 

Kr ^ virdg - discovery and data mining 
Full text available ^ pr* ; (J15.6 7 KB ) Additional Information: full citation , abstract , references , index terms 

Association ; is have received a lot of attention in the data mining community since their 
introduction. The classical approach to find rules whose items enjoy high support (appear in 
a lot of the transactions in the data set) is, however, filled with shortcomings. It has been 
shown that supoort can be misleading as an indicator of how interesting the rule is. 
Alternative measures, such as lift, have been proposed. More recently, a paper by 
DuMouchel et al. proposed the use of all -two-factor lo ... 

Keywords: ^sociation rule, grp^hical model, log-linear model 
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58 Ensembles an d boostin g: Predic tin g rare classes: can boosting make any weak learner j 
strong? 

Mahesh V. Joshi, Ramesh C. Agarwal, Vipin Kumar 

July 2002 Proceedings of the eighth ACM SIGKDD international conference on 

Knowledge discovery and data mining 
Full text available: |T] pdf(1.08 MB) Additional Information: full citation , abstract , references , index terms 

Boosting is a strong ensemble-based learning algorithm with the promise of iteratively 
improving the classification accuracy using any base learner, as long as it satisfies the 
condition of yielding weighted accuracy > 0.5. In this paper, we analyze boosting with 
respect to this basic condition on the base learner, to see if boosting ensures prediction of 
rarely occurring events with high recall and precision. First we show that a base learner can 
satisfy the required condition even for poor ... 



59 Research track papers: A proba bilisti c framework for semi-supervised clustering 
Sugato Basu, Mikhail Bilenko, Raymond J. Mooney 

August 2004 Proceedings of the 2004 ACM SIGKDD international conference on 

Knowledge discovery and data mining 
Full text available: r Q.Qdf(l87.51 KB) Additional Information: full citation , abstract , references , index terms 

Unsupervised clustering can be significantly improved using supervision in the form of 
pairwise constraints, i.e., pairs of instances labeled as belonging to same or different 
clusters. In recent years, a number of algorithms have been proposed for enhancing 
clustering quality by employing such supervision. Such methods use the constraints to 
either modify the objective function, or to learn the distance measure. We propose a 
probabilistic model for semi-supervised clustering based on Hidden Mar ... 

Keywords: distance metric learning, hidden Markov random fields, semi-supervised 
clustering 



60 Discovering Matrix Attac hmen t Regions ( MARs) in genomic databases 
Gautam B. Singh 

January 2000 A- M SIGKDD Explorations Newsletter, Volume l Issue 2 

Full text available: Q pdf(7 .?°.57 KB) Additional Information: full citation , abstract , references 

Lately, there has been considerable interest in applying Data Mining techniques to scientific 
and data analysis problems in bioinformatics. Data mining research is being fueled by novel 
application areas that are helping the development of newer applied algorithms in the field 
of bioinformatics, an emerging discipline representing the integration of biological and 
information sciences. This is a shift in paradigm from the earlier and the continuing data 
mining efforts in marketing resnrch and s ... 

Keywords: DNA Sequence Ana'ysis, MARs, Matrix Attachment Regions, bioinformatics, 

data mining, jone th erapy, medicrl data mining 
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